Portuguese “Vinho Verde” White Wine Exploration by Cam McLeod

Univariate Plots Section

## [1] 4898   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

50% of wines have a fixed acidity between 6.3 and 7.3 g/L. 50% of wines have a volatile acidity between 0.21 and 0.32 g/L. 50% of wines have a citric acid content between 0.27 and 0.39 g/L. Mean residual sugar is 6.391 g/L. Density falls within a relatively tight range of 0.9871 and 1.039 g/cm^3. 50% of wines have a pH between 3.090 and 3.28. Alcohol percentage varies from 8% to 14.2% . Wine quality varies between 3 and 9, with a median of 6 and mean of 5.878.

## [1] 10.3 10.3 10.7 10.7 11.8 14.2

Fixed acidity appears to have a relatively normal distribution around the mean of 6.855 g/L. There appear to be some outliers at a very high fixed acidity, around 11.8 and 14.2 g/L. I wonder if the high level of fixed acidity effects the wine quality.

##  [1] 0.780 0.785 0.815 0.850 0.905 0.910 0.930 0.965 1.005 1.100

Volatile acidity appears to have a relatively right skewed distribution around the median of 0.26 g/L. There appear to be a number of higher values above 0.70 g/L. Plotting the volatile acidity with a log10 scale there appears to be a near bimodal distribution with peaks at 0.18 and 0.3. I wonder if the high level of fixed acidity affects the wine quality, or if the ratio of acidities affects the wine quality. Below I have plotted the ratios of acidities in histograms.

In the bivariate plots section, I will compare the above plots. It appears the citric acid and volatile acidity have common ratios occuring at 1 and 2.

## [1] 1.00 1.00 1.00 1.00 1.23 1.66

Citric acid levels appears to have a relatively normal distribution around the median of 0.32 g/L. There appear to be some outliers with high citric acid levels, around 1.23 and 1.66 g/L. There is also a local mode at 0.49 g/L. This could be a regulated addition for certain types of wines or wineries. It would be interesting to see the quality of the wines at the local mode 0.49 g/L.

## [1] 23.50 26.05 26.05 31.60 31.60 65.80

Residual sugar levels appears to have a very right skewed distribution with a long tail. The median occurs at 5.2 g/L, the mean occurs at 6.391 g/L. Plotting residual sugar on a log10 scale shows a bimodal distribution with peaks around 2 and 10 g/L. There appear to be some outliers with residual sugar levels, at 31.60, and 65.80 g/L.

## [1] 0.244 0.255 0.271 0.290 0.301 0.346

The chloride levels show a slight bimodal distribution around 0.36 and 0.46 g/L, with a very large tail. Outliers run up to 0.346 g/L. Can the bimodal distribution be attributed to a correlation with another variable?

Free sulfur dioxided has a near normal distributio centered around 30 mg/L. The histogram for free sulfur dioxide has a proportionally large tail, running to 289 mg/L. My research shows that free SO2 is detectable by sensitive tasters, so it would be interesting to see if there is a threshold range at which the quality suffers.

Total sulfur dioxide has a near normal distribution near the median of 134 mg/L. Below the ratio of free sulfur dioxide to total sulfur dioxide is plotted on a histogram:

Below I have plotted density histograms:

Density appears to have some consistent - stepped density ranges. high frequencies between 0.991 to 0.994, mid frequencies between 0.994 and 0.996, and lower frequencies between 0.996 and 0.999. Maybe these densities correspond to alcohol percentages.

pH has a relatively normal distribution, centered around the median of 3.18.

Sulphates have a right skewed distribution, with a mean and median of 0.49 and 0.47 g/L, and should be correlated to the sulphur dioxide levels.

Alcohol percentage varies between 8% and 14.2%.

Quality varies between 3 and 9, with a median of 6 and a mean of 5.878.

Univariate Analysis

What is the structure of your dataset?

There are 4898 sampled wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur fioxide, density, pH, sulphates, alcohol, quality) All features are number types except quality - which is an integer. Quality has a range of 3 to 9, with a median of 6 and mean of 5.878.

What is/are the main feature(s) of interest in your dataset?

The main feature is quality. I am looking to find how the 11 input variables influence the quality of a wine, so I can predict the quality of a wine based on its chemical features.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

From the dataset summary, I can start with the hint that high amounts of volatile acidity lead to an unpleasant taste - therefore lowering the quality score. Citric acid is said to add ‘freshness’ and flavor to wines - so may be a good indicator of quality as well. Sulfur Dioxide content may also be a good indicator of quality, as it is detectable in high concentrations and may be unpleasant.

Did you create any new variables from existing variables in the dataset?

I created total acidity which is a sum of fixed acidity, volitile acidity and citric acid, and is measured in g/L.This will be useful in the bivariate and multivariate analysis where I investigate whether ratios of acids and other features effect quality. I also created a factor data type from the quality variable.

I also created a ratio of free sulfur dioxide to total sulfur dioxide. This will also be investigated with relation to quality and sulphate levels.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

It appears the ratio of citric acid to volatile acidity have local modes at 1 and 2 on the histograms, showing that wine makers may be adding these ingredients to make these ratios. Perhaps this ratio is used to influence the quality of the wine. There also appears to be a spike of results for citric acid at 0.49 g/L. Density appears to have some consistent - stepped density ranges. high frequencies between 0.991 to 0.994, mid frequencies between 0.994 and 0.996, and lower frequencies between 0.996 and 0.999. I will investigate if these ranges can be attributed to ranges in another feature.

I plotted all features with non normal distribuitions, or with long tailed distributions on a log10 scale. Volatile acidity and residual sugar were found to have bimodal distributions on this scale - I will investigate this in the following analysis.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity  citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.289180698
## volatile.acidity       -0.02269729       1.00000000 -0.149471811
## citric.acid             0.28918070      -0.14947181  1.000000000
## residual.sugar          0.08902070       0.06428606  0.094211624
## chlorides               0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
## density                 0.26533101       0.02711385  0.149502571
## pH                     -0.42585829      -0.03191537 -0.163748211
## sulphates              -0.01714299      -0.03572815  0.062330940
## alcohol                -0.12088112       0.06771794 -0.075728730
## quality                -0.11366283      -0.19472297 -0.009209091
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
##                        sulphates     alcohol      quality
## fixed.acidity        -0.01714299 -0.12088112 -0.113662831
## volatile.acidity     -0.03572815  0.06771794 -0.194722969
## citric.acid           0.06233094 -0.07572873 -0.009209091
## residual.sugar       -0.02666437 -0.45063122 -0.097576829
## chlorides             0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218
## density               0.07449315 -0.78013762 -0.307123313
## pH                    0.15595150  0.12143210  0.099427246
## sulphates             1.00000000 -0.01743277  0.053677877
## alcohol              -0.01743277  1.00000000  0.435574715
## quality               0.05367788  0.43557472  1.000000000

From our correlation coefficents above: - Fixed acidity is loosely correlated with pH. - Residual sugar is correlated with density, and loosely correlated with total sulfur dioxide and alcohol. - Chlorides are somewhat correlated to quality and alcohol - Free sulfur dioxide is closely correlated to total sulfur dioxide. - Total sulfur dioxide is correlated to density and negatively correlated to alcohol. - Density is correlated total alcohol and loosely correlated to quality. - Alcohol and quality are also loosely correlated.

Below I look at how the different features plot against quality

From the density functions on the ggpairs plot I see that linear correlation may not be effective in discovering trends in the data. The density functions show multiple peaks and troughs along multiple axis. This data will reveal most of its important information in the multivariate analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.60    6.90    7.10    7.42    7.40    9.10
## [1] 9.1 6.6 7.4 6.9 7.1

Wines with a quality of 9 tend to focus towards a fixed acidity of 7.42 g/L, however with only 5 wines achieving a quality of 9 this may not be the most reliable of statistics. Looking at the trend of fixed.acidity.mean, it appears the wine with a quality of 9 and a fixed.acidity of 9.1 may be the outlier skewing our trend, or perhaps it is a wine style with a different flavour profile. In general - lower quality wines have a higher fixed acidity.

Higher quality wines ( quality > 6) appear to have a lower volatile acidity. It also appears from the box plot and from the scatter plot that there may be some grouping in this relationship that could be explored with the multivariate plots.

From the above plots we have an anomaly where there is a high proportion of wines with a citric acid value between 0.49 and 0.5. I will isolate that value by color in a multivariate plot to investigate correlations with other properties as well. The mean by quality plot shows that quality tends to increase with increasing citric acid, However the citric acid mean for wines with quality of 3 is mid range.

In this plot and the bove plot it appears the higher quality wines are within a smaller range than the lower quality wines.

## Warning: Removed 20 rows containing missing values (geom_point).

Residual sugar shows a definite split along quality with groups focused at 1.5g/L and 11g/L.

Chlorides appear to drop off as quality increases - maybe a parabolic relationship, as there appears to be a peak around qualities of 5.

The plot for quality against the free sulfur dioxide/total sulfur dioxide ratio shows a potential for a linear relationship.

At higher qualities, densities appear to split between a group at 0.992g/mL and 0.997g/mL. This will be further investigated in the multivariate plots section.

Quality appears to increase with increasing pH.

Sulphates don’t appear to have a very high correlation to quality

It looks like our judges like higher alcohol wines. Whether this is correlated to another factor will soon be investigated in the multivariate plots section.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

It appears that as quality increases - the wine focuses on specific values for total sulphur dioxide and acidity. There are also some features that become bimodal with higher quality wines - such as residual sugar. This may be because of differing preferences, or because there are different styles within this type of wine, which may taste different, but all may be perceived as high quality. Fixed acidity focuses to two different values, volatile acidity focuses at high quality, citric acid sits within a certain range at high ratings. Residual sugar for wines with high quality sits at either near zero sugar or 10 g/dm^3. Chloride levels are low for high quality wines. Free sulfur dioxide focuses to two different values, as well as total sulfur dioxide. Density focuses to two different values. pH focuses to around 3.3. sulphates sit at 2 specific ranges. Alcohol at higher ratings sit around 10.3 and 12.5 percent. From our correlation matrix, it appears quality is most closely correlated to density alchohol and chlorides.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Alcohol and density are highly correlated. This relationship is expected, as density is used to measure final alcohol content in wines. Residual sugar, alchohol and acetic acid (volatile acidity) are also an interesting relationship - since alcohol and acetic acid are the products of metabolization of sugar by the yeast.

What was the strongest relationship you found?

Density and residual sugar had the highest correlation of 0.79, The relationship is clear in the ggpairs plot.

Multivariate Plots Section

I’ve plotted a scatter plot of residual sugar vs. density, colorizing the points by quality using a diverging color scheme. I chose these variables first due to their high correlation. There is a clear divergence between higher and lower quality wines.

Chlorides and alcohol vs. density also show divergent clusters in high vs. low quality wines.

Not seeing a very clear divergence or pattern for chlorides and total.sulfur.dioxide, or for density, chlorides and quality.

Not seeing a clear pattern or correlation with these variables either.

The bar chart is really helpful here in showing quality thresholds for each variable. for example - wines with free sulfur dioxide levels above 110 mg / L are almost exclusively low quality. With the histograms we see the frequency of occurrences within these ranges.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.974   6.000   9.000

I created a new dataframe of white wines, excluding wines which sit outside the quality thresholds for specific variables. I limited variables to the following ranges: free sulfur dioxide between 15 and 110 fixed acidity below 9 volatile acidity below 0.6 total sulfur dioxide between 60 and 210

Now in viewing the bar chart and histograms, we can see much clearer patterns. The wines that had chemical amounts exceeding threshold values are excluded.

Above I have plotted residual sugar against density and encoding quality in color, excluding the wines with out of threshold values. It appears this has removed some of the higher variance low quality points, giving us a clearer picture of the relationship.

The clearer picture can also be seen with chlorides vs. sulfur dioxide, chlorides vs. density, and density vs. alcohol.

When residual sugar is plotted vs alcohol the relationship is less obvious, however it can be seen that there is a much higher proportion of high sugar/low alcohol, low quality wines than high sugar / high alcohol wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

As total sulfur dioxide increases, quality decreases. This was made easier to view by adding the dimension of chloride content.

The interaction between the chlorides and density with quality was made more apparent by plotting them against each other.

Were there any interesting or surprising interactions between features?

Plotting residual sugar vs. density, and colorizing by quality, a clear divergence became clear when holding either constant. This was seen with alcohol vs. density as well. This is interesting due to the critical interaction of these three variables in the wine making process. As wine ferments, alcohol replaces sugar. Additionally the density of wine is measured during the wine making process to determine alcohol content.

A really interesting find were the quality thresholds in free sulfur dioxide, fixed acidity, volatile acidity and total sulfur dioxide. Above and below certain thresholds for these chemicals, wine quality suffered immensely.


Final Plots and Summary

Plot One

Description One

This distribution is bimodal, and shows a clear difference between a dry vs. a sweet wine.

Plot Two

Description Two

This plot tells an interesting story. For a given amount of residual sugar, lower density wines have higher quality. It is very improbable to have a high quality wine with a density over 0.995. Also, from this plot we see that higher alcohol wines score higher. Proportionally there is a much higher probability of having a high sugar wine if there is a high alcohol content.

Plot Three

Description Three

This plot shows the proportions of different wine qualities for all other variables. The most interesting parts of this plot are the clear quality thresholds in free sulfur dioxide, fixed acidity, volatile acidity and total sulfur dioxide. When the wine moves outside suitable ranges for these chemicals (very high proportions of 3 and 4 quality wines) it could be argued that the wine is ruined.


Reflection

This dataset contains neary 5000 white wines of the Portuguese “Vinho Verde” variety. By observing the histograms of each of the 12 variables, and then plotting each of the eleven against the wine quality, I was able to isolate the variables which were most correlated to quality - alcohol, density and residual sugar. By plotting each variable in proportion bar charts along their respective ranges, I was able to find ranges of chemicals which have very high proportions of poor quality wines. For wine makers, this could be very useful information - if you see a batch of wine testing outside those threshold ranges, it is time to review your recipe or process, for that wine will most likely be of poor quality. As someone picking out a bottle of wine, the only information readily available is the alcohol content, which is printed on the bottle. Residual sugar or sweetness of the wine may be hinted at on the label, as well as whether sulphates were added. From my analysis, a non-expert in wine may find better luck with higher alcohol wines. If one were to require a sweeter wine, a higher proportion of high quality wines exist with higher alcohol contents.

I found difficulty in the bivariate plots section, since the level of correlation between most data was quite low. It was necessary to add other variables to see any type of pattern. I would have liked to observe the variables over limited ranges, and see if any patterns could be picked out using that technique. It was a good learning experience to plot all variables individually in the univariate and bivariate plot sections, however, not all information was useful. In the future I will rely more on statistical information to guide my investigations into correlation. I will say that simply finding correlations between variables is not enough as there could be cyclical or non-linear interactions that a simple correlation test will not find. I was very happy to find the relationship between density, residual sugar and quality. An interesting future investigation would be to find the concentration at which different acidities, chlorides and sulphates have a discernable taste, and then colorize the wines in excess of those values to see where they occur on scatter plots of different variables. I think another interesting investigation would be to compare dry vs. sweet wines by splitting the residual sugar variable into two groups.